Сервису по продаже автомобилей с пробегом «Не бит, не крашен» необходимро разработать приложение, чтобы привлечь новых клиентов, в нём можно будет узнать рыночную стоимость своего автомобиля. Построим модель, которая умеет её определять по данным о технических характеристиках и параметрах автомобилей.
import lightgbm as lgb
import numpy as np
import pandas as pd
import seaborn as sns
import ydata_profiling
from datetime import datetime, date
from matplotlib import pyplot as plt
from catboost import CatBoostRegressor, Pool
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from time import perf_counter
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Посмотрим на имеющиеся данные
try:
df = pd.read_csv('datasets/autos.csv')
except:
df = pd.read_csv('/datasets/autos.csv')
df.head()
| DateCrawled | Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Kilometer | RegistrationMonth | FuelType | Brand | Repaired | DateCreated | NumberOfPictures | PostalCode | LastSeen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-03-24 11:52:17 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-24 00:00:00 | 0 | 70435 | 2016-04-07 03:16:57 |
| 1 | 2016-03-24 10:58:45 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 2016-03-24 00:00:00 | 0 | 66954 | 2016-04-07 01:46:50 |
| 2 | 2016-03-14 12:52:21 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 2016-03-14 00:00:00 | 0 | 90480 | 2016-04-05 12:47:46 |
| 3 | 2016-03-17 16:54:04 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 2016-03-17 00:00:00 | 0 | 91074 | 2016-03-17 17:40:17 |
| 4 | 2016-03-31 17:25:20 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 2016-03-31 00:00:00 | 0 | 60437 | 2016-04-06 10:17:21 |
ydata_profiling.ProfileReport(df)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Основные проблемы текущего датасета:
На тепловой карте датасета можно заметить, что целевой признак имеет корреляцию с мощностью автомобиля и годом его выпуска, а также обратную корреляцию с пробегом, что говорит о том, что данные в целом адекватные.
df = df.drop_duplicates(keep='first')
df['DateCrawled'] = pd.to_datetime(df['DateCrawled'])
df.DateCrawled.max().year
2016
df['DateCreated'] = pd.to_datetime(df['DateCreated'])
df.DateCreated.max().year
2016
Данные в датасете ограничены 2016-м годом, учтём это в дальнейшей работе.
Избавимся от неинформативных признаков.
df = df.drop(['DateCrawled', 'DateCreated', 'PostalCode', 'LastSeen', 'NumberOfPictures'], axis=1)
В датасете значительное количество пропусков в столбцах 'VehicleType', 'Gearbox', 'FuelType', для некоторых моделей есть только один вариант комплектации по этим признакам, найдём такие модели и заполним пропуски имеющимися значениями.
criteria = ['VehicleType', 'Gearbox', 'FuelType']
for criterion in criteria:
for vehicle in df.Model.unique():
if len(df.loc[(df.Model == vehicle) & (df[criterion].notna()), criterion].unique()) == 1:
df.loc[(df.Model == vehicle) & (df[criterion].isna()), criterion] = \
df.loc[(df.Model == vehicle) & (df[criterion].notna()), criterion].unique()[0]
df.isna().sum()
Price 0 VehicleType 37428 RegistrationYear 0 Gearbox 19755 Power 0 Model 19705 Kilometer 0 RegistrationMonth 0 FuelType 32737 Brand 0 Repaired 71154 dtype: int64
Таких пропусков оказалось не много, заполним остальные пропуски строкой 'unspecified'.
criteria = ['VehicleType', 'Gearbox', 'FuelType', 'Model', 'Repaired']
for criterion in criteria:
df.loc[df[criterion].isna(), criterion] = 'unspecified'
df.isna().sum()
Price 0 VehicleType 0 RegistrationYear 0 Gearbox 0 Power 0 Model 0 Kilometer 0 RegistrationMonth 0 FuelType 0 Brand 0 Repaired 0 dtype: int64
df.FuelType.unique()
array(['petrol', 'gasoline', 'unspecified', 'lpg', 'other', 'hybrid',
'cng', 'electric'], dtype=object)
В стобце 'FuelType' есть неявные дубликаты значений, заменим их.
df.loc[df.FuelType == 'gasoline', 'FuelType'] = 'petrol'
df.loc[df.FuelType == 'cng', 'FuelType'] = 'lpg'
df.loc[df.FuelType == 'other', 'FuelType'] = 'unspecified'
df.FuelType.unique()
array(['petrol', 'unspecified', 'lpg', 'hybrid', 'electric'], dtype=object)
df[df.RegistrationYear > 2016]
| Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Kilometer | RegistrationMonth | FuelType | Brand | Repaired | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 2900 | unspecified | 2018 | manual | 90 | meriva | 150000 | 5 | petrol | opel | no |
| 26 | 5555 | unspecified | 2017 | manual | 125 | c4 | 125000 | 4 | unspecified | citroen | no |
| 48 | 7750 | unspecified | 2017 | manual | 80 | golf | 100000 | 1 | petrol | volkswagen | unspecified |
| 51 | 2000 | unspecified | 2017 | manual | 90 | punto | 150000 | 11 | petrol | fiat | yes |
| 57 | 2399 | unspecified | 2018 | manual | 64 | other | 125000 | 3 | unspecified | seat | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354112 | 3600 | unspecified | 2017 | manual | 86 | transit | 150000 | 5 | petrol | ford | unspecified |
| 354140 | 1000 | unspecified | 2017 | manual | 101 | a4 | 150000 | 9 | unspecified | audi | unspecified |
| 354203 | 2140 | unspecified | 2018 | manual | 80 | fiesta | 150000 | 6 | unspecified | ford | no |
| 354253 | 1250 | unspecified | 2018 | unspecified | 0 | corsa | 150000 | 0 | petrol | opel | unspecified |
| 354289 | 5000 | unspecified | 2017 | manual | 120 | other | 150000 | 7 | unspecified | citroen | yes |
14530 rows × 11 columns
В значениях 'RegistrationYear' много аномальных значений, избавимся от строк, где значения выше 2016 и тех, что являются статистическими выбросами (опеределение цены ретро-автомобилей - это отдельная задача требующая индивидуального подхода к каждому автомобилю.)
df = df[df.RegistrationYear <= 2016]
def clear_data(data, criterion):
IQR = data[criterion].quantile(0.75) - data[criterion].quantile(0.25)
bottom_threshold = data[criterion].quantile(0.25) - 1.5 * IQR
if bottom_threshold < 0:
bottom_threshold = 0
top_threshold = data[criterion].quantile(0.75) + 1.5 * IQR
return data[(data[criterion] > bottom_threshold) & (data[criterion] < top_threshold)]
df = clear_data(df, 'RegistrationYear')
df.boxplot('RegistrationYear');
df.boxplot('Power');
Избавимся также от статистических выбросов и нулевых значений в стобце 'Power', заменим значения на медианные.
%%time
for vehicle in df.Model.unique():
top_threshold = df.loc[df.Model == vehicle, 'Power'].quantile(0.75) + 1.5 *\
(df.loc[df.Model == vehicle, 'Power'].quantile(0.75) - df.loc[df.Model == vehicle, 'Power'].quantile(0.25))
df.loc[(df.Model == vehicle) & (df['Power'] > top_threshold), 'Power'] = df.loc[df.Model == vehicle, 'Power'].median()
df.loc[(df.Model == vehicle) & (df['Power'] == 0), 'Power'] = df.loc[df.Model == vehicle, 'Power'].median()
CPU times: user 27.7 s, sys: 0 ns, total: 27.7 s Wall time: 27.7 s
df.boxplot('Power');
df[df.Power == 0]
| Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Kilometer | RegistrationMonth | FuelType | Brand | Repaired | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3338 | 199 | unspecified | 1995 | manual | 0.0 | samara | 100000 | 0 | petrol | lada | unspecified |
| 26625 | 200 | unspecified | 2016 | manual | 0.0 | samara | 150000 | 0 | petrol | lada | unspecified |
| 42665 | 300 | small | 1997 | manual | 0.0 | samara | 50000 | 3 | petrol | lada | yes |
| 45401 | 0 | wagon | 2007 | manual | 0.0 | samara | 125000 | 8 | petrol | lada | unspecified |
| 69358 | 99 | unspecified | 1995 | manual | 0.0 | samara | 125000 | 0 | petrol | lada | unspecified |
| 150050 | 650 | wagon | 2007 | manual | 0.0 | samara | 125000 | 8 | petrol | lada | unspecified |
| 353555 | 2199 | unspecified | 2016 | manual | 0.0 | samara | 70000 | 0 | unspecified | lada | no |
После замены значений на медианные, всё же остались несколько нулевых значений, удалим их из датасета.
df = df[df.Power != 0]
Избавимся от нулевых значений и возможных аномалий в целевом признаке, заменим их на медианные значения.
df.Price.hist(bins=100);
%%time
for vehicle in df.Model.unique():
top_threshold = df.loc[df.Model == vehicle, 'Price'].quantile(0.75) + 1.5 *\
(df.loc[df.Model == vehicle, 'Price'].quantile(0.75) - df.loc[df.Model == vehicle, 'Price'].quantile(0.25))
df.loc[(df.Model == vehicle) & (df['Price'] > top_threshold), 'Price'] = df.loc[df.Model == vehicle, 'Price'].median()
df.loc[(df.Model == vehicle) & (df['Price'] == 0), 'Price'] = df.loc[df.Model == vehicle, 'Price'].median()
CPU times: user 28 s, sys: 5.76 ms, total: 28 s Wall time: 28 s
df[df.Price == 0]
| Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Kilometer | RegistrationMonth | FuelType | Brand | Repaired |
|---|
df[df.Price < 50]['Price'].hist(bins=100);
df = df[df.Price > df.Price.quantile(0.005)]
df.RegistrationMonth.unique()
array([ 0, 5, 8, 6, 7, 10, 12, 11, 2, 3, 1, 4, 9])
В столбце RegistrationMonth есть значения нулевого месяца, т.к. в дальнейшем мы будем рассчитывать возраст автомобиля, то заменим это значение на среднее.
df.loc[df.RegistrationMonth == 0, 'RegistrationMonth'] = 6
df['RegistrationMonth'].hist();
Преобразуем значения 'RegistrationYear' и 'RegistrationMonth' в зхначения возраста автомобиля и запишем их в отдельный столбец.
df['Registration'] = df['RegistrationYear'].astype('str') + '-' + df['RegistrationMonth'].astype('str')
df['Registration'] = pd.to_datetime(df['Registration'], format='%Y-%M')
df['CarAge'] = (datetime.now() - df['Registration']).astype('<m8[Y]')
df = df.drop(['Registration', 'RegistrationYear', 'RegistrationMonth'], axis=1)
df.head()
| Price | VehicleType | Gearbox | Power | Model | Kilometer | FuelType | Brand | Repaired | CarAge | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 480.0 | unspecified | manual | 98.0 | golf | 150000 | petrol | volkswagen | unspecified | 30.0 |
| 1 | 1250.0 | coupe | manual | 190.0 | unspecified | 125000 | petrol | audi | yes | 12.0 |
| 2 | 9800.0 | suv | auto | 163.0 | grand | 125000 | petrol | jeep | unspecified | 19.0 |
| 3 | 1500.0 | small | manual | 75.0 | golf | 150000 | petrol | volkswagen | no | 22.0 |
| 4 | 3600.0 | small | manual | 69.0 | fabia | 90000 | petrol | skoda | no | 15.0 |
df.shape
(330669, 10)
ydata_profiling.ProfileReport(df)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
После проведённых замен и преобразований в датасете образовалось 9.9% дубликатов строк, удалим их из датасета.
df = df.drop_duplicates(keep='first')
Обучим несколько моделей для сравнения их адекватности и эффективности.
Чтобы сократить время обучения моделей, примем число разбиения обучающей выборки для кроссвалидации равным 3 (в GridSearchCV и RandomizedSearchCV по умолчанию - 5).
crossval = KFold(n_splits=3, shuffle=True)
Разделим целевые признаки и параметры.
features = df.drop('Price', axis=1)
target = df['Price']
Изменим тип стобцов с категориальными параметрами на категориальный.
for col in features.columns:
if features[col].dtype == 'object':
features[col] = features[col].astype('category')
Разобьём данные на обучающую и тестовую выборку.
features_train, features_test, target_train, target_test = \
train_test_split(features, target, test_size=0.2)
for frame in [features_train, features_test, target_train, target_test]:
print(frame.shape)
(223481, 9) (55871, 9) (223481,) (55871,)
ohe_features = features.select_dtypes(include='category').columns.to_list()
ohe_features
['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']
num_features = features.select_dtypes(exclude='category').columns.to_list()
num_features
['Power', 'Kilometer', 'CarAge']
features_train.columns
Index(['VehicleType', 'Gearbox', 'Power', 'Model', 'Kilometer', 'FuelType',
'Brand', 'Repaired', 'CarAge'],
dtype='object')
Создадим пайплайн модели и параметры для перебора с помощью GridSearchCV.
col_transformer = make_column_transformer(
(
OneHotEncoder(drop='first', handle_unknown='ignore'),
ohe_features
),
(
StandardScaler(),
num_features
),
#remainder='passthrough',
verbose_feature_names_out=False
)
ridge_pipe = make_pipeline(col_transformer, Ridge())
ridge_params = {
'ridge__alpha': np.logspace(-3, 3, 14),
}
ridge_grid = GridSearchCV(
ridge_pipe,
param_grid=ridge_params,
scoring='neg_root_mean_squared_error',
n_jobs=-1,
verbose=False
)
%%time
ridge_grid.fit(features_train, target_train)
/home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn(
CPU times: user 8.9 s, sys: 8.3 s, total: 17.2 s Wall time: 34.1 s
GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('ridge', Ridge())]),
n_jobs=-1,
param_grid={'ridge__alpha': array([1.00000000e-03, 2.89426612e-03, 8.37677640e-03, 2.42446202e-02,
7.01703829e-02, 2.03091762e-01, 5.87801607e-01, 1.70125428e+00,
4.92388263e+00, 1.42510267e+01, 4.12462638e+01, 1.19377664e+02,
3.45510729e+02, 1.00000000e+03])},
scoring='neg_root_mean_squared_error', verbose=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('ridge', Ridge())]),
n_jobs=-1,
param_grid={'ridge__alpha': array([1.00000000e-03, 2.89426612e-03, 8.37677640e-03, 2.42446202e-02,
7.01703829e-02, 2.03091762e-01, 5.87801607e-01, 1.70125428e+00,
4.92388263e+00, 1.42510267e+01, 4.12462638e+01, 1.19377664e+02,
3.45510729e+02, 1.00000000e+03])},
scoring='neg_root_mean_squared_error', verbose=False)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox',
'Model', 'FuelType', 'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power', 'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('ridge', Ridge())])ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox', 'Model', 'FuelType',
'Brand', 'Repaired']),
('standardscaler', StandardScaler(),
['Power', 'Kilometer', 'CarAge'])],
verbose_feature_names_out=False)['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']
OneHotEncoder(drop='first', handle_unknown='ignore')
['Power', 'Kilometer', 'CarAge']
StandardScaler()
Ridge()
Создадим пайплайн модели и параметры для перебора с помощью GridSearchCV.
lasso_pipe = make_pipeline(col_transformer, Lasso())
lasso_params = [{
'lasso__alpha': [0.02, 0.03]
} ]
lasso_grid = GridSearchCV(lasso_pipe,
param_grid=lasso_params,
cv=crossval,
scoring='neg_root_mean_squared_error',
n_jobs=-1,
verbose=False)
Обучим модель.
%%time
lasso_grid.fit(features_train, target_train);
/home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 7466370401.325562, tolerance: 246942487.87277555 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1720008295.937622, tolerance: 245924663.71348873 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 4546902296.151733, tolerance: 245924663.71348873 model = cd_fast.sparse_enet_coordinate_descent(
CPU times: user 6min 17s, sys: 3min 54s, total: 10min 11s Wall time: 7min 25s
GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('lasso', Lasso())]),
n_jobs=-1, param_grid=[{'lasso__alpha': [0.02, 0.03]}],
scoring='neg_root_mean_squared_error', verbose=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('lasso', Lasso())]),
n_jobs=-1, param_grid=[{'lasso__alpha': [0.02, 0.03]}],
scoring='neg_root_mean_squared_error', verbose=False)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox',
'Model', 'FuelType', 'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power', 'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('lasso', Lasso())])ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox', 'Model', 'FuelType',
'Brand', 'Repaired']),
('standardscaler', StandardScaler(),
['Power', 'Kilometer', 'CarAge'])],
verbose_feature_names_out=False)['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']
OneHotEncoder(drop='first', handle_unknown='ignore')
['Power', 'Kilometer', 'CarAge']
StandardScaler()
Lasso()
Создадим пайплайн модели и параметры для перебора с помощью GridSearchCV.
elasticnet = ElasticNet()
elastic_pipe = make_pipeline(col_transformer, elasticnet)
elastic_params = [
{'elasticnet__alpha': [.0004, .0005],
'elasticnet__l1_ratio': [0.5, 1]}
]
elastic_grid = GridSearchCV(elastic_pipe,
param_grid=elastic_params,
cv=crossval,
scoring='neg_root_mean_squared_error',
n_jobs=-1,
verbose=False)
Обучим модель.
%%time
elastic_grid.fit(features_train, target_train)
/home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 391435949645.14954, tolerance: 247036443.75258037 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 373619563367.9604, tolerance: 245929393.09242558 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 342095702999.1483, tolerance: 246924559.26116863 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 360964118777.4171, tolerance: 245929393.09242558 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 383117325940.39307, tolerance: 247036443.75258037 model = cd_fast.sparse_enet_coordinate_descent( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:182: UserWarning: Found unknown categories in columns [2] during transform. These unknown categories will be encoded as all zeros warnings.warn( /home/romauov/anaconda3/envs/env39/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:593: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 324160726822.5111, tolerance: 246924559.26116863 model = cd_fast.sparse_enet_coordinate_descent(
CPU times: user 8min 14s, sys: 4min 49s, total: 13min 3s Wall time: 16min 38s
GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('elasticnet', ElasticNet())]),
n_jobs=-1,
param_grid=[{'elasticnet__alpha': [0.0004, 0.0005],
'elasticnet__l1_ratio': [0.5, 1]}],
scoring='neg_root_mean_squared_error', verbose=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType',
'Gearbox',
'Model',
'FuelType',
'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power',
'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('elasticnet', ElasticNet())]),
n_jobs=-1,
param_grid=[{'elasticnet__alpha': [0.0004, 0.0005],
'elasticnet__l1_ratio': [0.5, 1]}],
scoring='neg_root_mean_squared_error', verbose=False)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox',
'Model', 'FuelType', 'Brand',
'Repaired']),
('standardscaler',
StandardScaler(),
['Power', 'Kilometer',
'CarAge'])],
verbose_feature_names_out=False)),
('elasticnet', ElasticNet())])ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(drop='first',
handle_unknown='ignore'),
['VehicleType', 'Gearbox', 'Model', 'FuelType',
'Brand', 'Repaired']),
('standardscaler', StandardScaler(),
['Power', 'Kilometer', 'CarAge'])],
verbose_feature_names_out=False)['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']
OneHotEncoder(drop='first', handle_unknown='ignore')
['Power', 'Kilometer', 'CarAge']
StandardScaler()
ElasticNet()
Создадим модель и параметры для случайного поиска с помощью RandomizedSearchCV.
lgbm_model = lgb.LGBMRegressor(verbose=-1)
lgbm_params = {
'num_leaves': [31, 41, 51],
'min_child_samples': [5, 15],
'max_depth': [10, 20],
'learning_rate': np.logspace(-3, -1, 3),
'reg_alpha': np.logspace(-4, -2, 2)
}
lgbm_grid = RandomizedSearchCV(lgbm_model,
lgbm_params,
cv=crossval,
scoring='neg_root_mean_squared_error',
n_iter=10)
Обучим модель.
%%time
lgbm_grid.fit(X=features_train, y=target_train)
CPU times: user 1min 29s, sys: 205 ms, total: 1min 30s Wall time: 24.2 s
RandomizedSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=LGBMRegressor(verbose=-1),
param_distributions={'learning_rate': array([0.001, 0.01 , 0.1 ]),
'max_depth': [10, 20],
'min_child_samples': [5, 15],
'num_leaves': [31, 41, 51],
'reg_alpha': array([0.0001, 0.01 ])},
scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=LGBMRegressor(verbose=-1),
param_distributions={'learning_rate': array([0.001, 0.01 , 0.1 ]),
'max_depth': [10, 20],
'min_child_samples': [5, 15],
'num_leaves': [31, 41, 51],
'reg_alpha': array([0.0001, 0.01 ])},
scoring='neg_root_mean_squared_error')LGBMRegressor(verbose=-1)
LGBMRegressor(verbose=-1)
Покажем важность признаков модели.
lgb.plot_importance(lgbm_grid.best_estimator_, height=.5);
Создадим модель и параметры для случайного поиска с помощью RandomizedSearchCV.
cat_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']
cat_model = CatBoostRegressor(loss_function='RMSE',
silent=True,
cat_features=cat_features)
cat_params = {
'learning_rate': np.logspace(-3, 0, 5),
'iterations': [40, 60],
'depth': [d for d in range(2, 11)],
}
cat_grid = RandomizedSearchCV(cat_model,
cat_params,
cv=crossval,
scoring='neg_root_mean_squared_error',
n_iter=10)
Обучим модель.
%%time
cat_grid.fit(features_train, target_train);
CPU times: user 3min 25s, sys: 9.15 s, total: 3min 35s Wall time: 44.3 s
RandomizedSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=<catboost.core.CatBoostRegressor object at 0x7f8f5b3a76d0>,
param_distributions={'depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'iterations': [40, 60],
'learning_rate': array([0.001 , 0.00562341, 0.03162278, 0.17782794, 1. ])},
scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=True),
estimator=<catboost.core.CatBoostRegressor object at 0x7f8f5b3a76d0>,
param_distributions={'depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'iterations': [40, 60],
'learning_rate': array([0.001 , 0.00562341, 0.03162278, 0.17782794, 1. ])},
scoring='neg_root_mean_squared_error')<catboost.core.CatBoostRegressor object at 0x7f8f5b3a76d0>
<catboost.core.CatBoostRegressor object at 0x7f8f5b3a76d0>
Покажем важность признаков модели.
cat_imp = cat_grid.best_estimator_.feature_importances_
sorted_idx = np.argsort(cat_imp)
fig = plt.figure(figsize=(12, 6))
plt.barh(range(len(sorted_idx)), cat_imp[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), np.array(features.columns)[sorted_idx])
plt.title('Feature Importance');
Как видно из графиков модели LightGBM и CatBoost по-разному оценивают влияние параметров на целевые признаки.
Соберём данные моделей в единую таблицу.
grids = [ridge_grid, lasso_grid, elastic_grid, lgbm_grid, cat_grid]
models_data = []
for grid in grids:
# id модели с лучшими параметрами
best_id = np.argmin(grid.cv_results_['rank_test_score'])
models_data.append(
[
# лучшее значение метрики RMSE на обучающей выборке
(-grid.cv_results_['mean_test_score'][best_id]).round(3),
# время обучения модели
grid.cv_results_['mean_fit_time'][best_id].round(3),
# время плучения предсказаний
grid.cv_results_['mean_score_time'][best_id].round(3)
]
)
models_df = pd.DataFrame(data=models_data,
index=['Ridge', 'Lasso', 'ElasticNet', 'LightGBM', 'CatBoost'],
columns=['лучшее RMSE на обучающей выборке', 'время обучения модели', 'время получения предсказаний']
)
display(models_df)
| лучшее RMSE на обучающей выборке | время обучения модели | время получения предсказаний | |
|---|---|---|---|
| Ridge | 2403.425 | 4.437 | 0.198 |
| Lasso | 2403.366 | 211.918 | 0.166 |
| ElasticNet | 2404.196 | 279.357 | 0.284 |
| LightGBM | 1670.165 | 0.577 | 0.188 |
| CatBoost | 1779.211 | 2.202 | 0.037 |
Лучший результат в итоге показала модель Light GBM, скорость её обучения также самая высокая. Посмотрим на результат этой модели на тестовой выборке:
(mean_squared_error(target_test, grid.best_estimator_.predict(features_test)) ** 0.5).round(1)
1764.0
Результат соответствует целевому уровню.
lgbm_grid.best_params_
{'reg_alpha': 0.01,
'num_leaves': 51,
'min_child_samples': 5,
'max_depth': 20,
'learning_rate': 0.1}